Funnel plot for discovering uncertainty
Import the packages needed for this exercise.
packages = c('tidyverse','readxl','dplyr','ggplot2',
'FunnelPlotR','plotly','knitr','performance','ggpubr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Below shows the sketch of the proposed funnel plot design.
Import the sheets ‘Orders’ and ‘Returns’.
orders <- read_xls("Superstore-2021.xls", sheet = "Orders")
returns <- read_xls("Superstore-2021.xls", sheet = "Returns")
Join the two data by ‘Order ID’ and fill in value 1 for those orders that are returned, 0 otherwise.
joined_tab <- left_join(orders, returns, by = c('Order ID'='Order ID'))
joined_tab$Returned<-ifelse(joined_tab$Returned=="Yes",1,0)
joined_tab[is.na(joined_tab)] <- 0
unique(joined_tab$Returned)
[1] 0 1
Group the data by state using group_by() function. Calculate the proportion of returns and store the values to a new column named ‘prop’.
grouped_df <- joined_tab %>%
group_by(State) %>%
summarise(OrderNum = n(),
ReturnedNum = sum(Returned))
head(as.data.frame(grouped_df))
State OrderNum ReturnedNum
1 Alabama 63 4
2 Arizona 274 72
3 Arkansas 60 0
4 California 3033 1370
5 Colorado 246 89
6 Connecticut 82 0
Convert column ‘State’ from type ‘chr’ to type ‘Factor’.
grouped_df <- grouped_df %>%
mutate_if(is.character, as.factor)
Calculate the proportion of orders returned and store the value in column ‘prop’ column of ‘grouped_df’. States with zero returns are removed.
grouped_df <- grouped_df %>%
mutate(prop =ReturnedNum/OrderNum) %>%
filter(prop >0)
A quick look at the distribution of ‘prop’ suggests that the data is not normal and skewed to right.
hist(grouped_df$prop)
To use funnel plot to find uncertainty, we need to check for normality. Below is a Q-Q plot for the column ‘prop’. As many of the points are not near the line, the data may not be normal.
ggqqplot(grouped_df$prop)
To further confirm the observation, conduct Shapiro-Wilk test to check for normality. As the p-value is smaller than 0.05, the data failed the test.
shapiro.test(grouped_df$prop)
Shapiro-Wilk normality test
data: grouped_df$prop
W = 0.92675, p-value = 0.01997
‘prop’ need to be transformed. Below code transform the data using square root transformation.
The new p-value we get now is greater than 0.05, meaning that the data now is not significantly different from normal distribution. We can now proceed to plot the graph.
grouped_df$prop_tranformed <- sqrt(grouped_df$prop)
shapiro.test(grouped_df$prop_tranformed)
Shapiro-Wilk normality test
data: grouped_df$prop_tranformed
W = 0.97635, p-value = 0.6215
Compute the lower and upper limits for 95% and 99.9% confidence interval.
grouped_df <- grouped_df %>%
mutate(prop_tranformed.se = sqrt((prop_tranformed*(1-prop_tranformed)) / (OrderNum)))
fit.mean <- weighted.mean(grouped_df$prop_tranformed, 1/grouped_df$prop_tranformed.se^2)
number.seq <- seq(1, max(grouped_df$OrderNum), 1)
number.ll95 <- fit.mean - 1.96 * sqrt((fit.mean*(1-fit.mean)) / (number.seq))
number.ul95 <- fit.mean + 1.96 * sqrt((fit.mean*(1-fit.mean)) / (number.seq))
number.ll999 <- fit.mean - 3.29 * sqrt((fit.mean*(1-fit.mean)) / (number.seq))
number.ul999 <- fit.mean + 3.29 * sqrt((fit.mean*(1-fit.mean)) / (number.seq))
dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, fit.mean)
Plot a static funnel plot using ggplot().
p <- ggplot(grouped_df, aes(x = OrderNum, y = prop_tranformed)) +
geom_point(alpha=0.4, colour = 'blue',aes(text = paste0("State: ", State,"\n","Returned %: ", round(prop*100,2),"\n",
"Order Num: ", OrderNum))) +
geom_line(data = dfCI,
aes(x = number.seq,
y = number.ll95),
size = 0.4,
colour = "dark green",
linetype = "dashed") +
geom_line(data = dfCI,
aes(x = number.seq,
y = number.ul95),
size = 0.4,
colour = "dark green",
linetype = "dashed") +
geom_line(data = dfCI,
aes(x = number.seq,
y = number.ll999),
size = 0.4,
colour = "red") +
geom_line(data = dfCI,
aes(x = number.seq,
y = number.ul999),
size = 0.4,
colour = "red") +
geom_hline(data = dfCI,
aes(yintercept = fit.mean),
size = 0.4,
colour = "grey40") +
coord_cartesian(ylim=c(-0.2,0.90)) +
annotate("text", x = 1, y = -0.13, label = "95%", size = 3, colour = "dark Green") +
annotate("text", x = 4.5, y = -0.18, label = "99.9%", size = 3, colour = "Red") +
ggtitle("Returned Rate by State") +
xlab("Number of orders") +
ylab("Transformed order return rate") +
theme_light() +
theme(plot.title = element_text(size=12),
legend.position = c(0.91,0.85),
legend.title = element_text(size=7),
legend.text = element_text(size=7),
legend.background = element_rect(colour = "grey60", linetype = "dotted"),
legend.key.height = unit(0.3, "cm"))
p
Using ggplotly() of plotly r package, the funnel plot can be made interactive. In the plot below, when hovering over the points, information about state name, rate of return and order number will be displayed. An interactive funnel plot is better as user can just hover over the points to know which State is outside the boundary.
From the funnel plot, the further a point is away from the boundaries, the more likely that it is due to special cause rather than randomness.
fp_ggplotly <- ggplotly(p,tooltip = "text")
fp_ggplotly
The main point to take note is to check the normality of data. In this example, although the sample size is larger than 30, but the data failed the normality test. Therefore, it is important to always check for this assumption.
Test for normality: [http://www.sthda.com/english/wiki/normality-test-in-r]
Funnel plot description: [https://www.perceptualedge.com/articles/visual_business_intelligence/variation_and_its_discontents.pdf]
Creating funnel plot using R: [https://isss608-ay2021-22t2.netlify.app/hands-on_ex/hands-on_ex04/hands-on_ex04-funnelplot#computing-the-basic-derived-fields]